Skip to content

Use flock instead of lsof command in run-command-shim to prevent run commands hanging due to lsof command hanging#70

Open
viveklingaiah wants to merge 5 commits into
mainfrom
dev/vivekl/lsofAlt
Open

Use flock instead of lsof command in run-command-shim to prevent run commands hanging due to lsof command hanging#70
viveklingaiah wants to merge 5 commits into
mainfrom
dev/vivekl/lsofAlt

Conversation

@viveklingaiah

@viveklingaiah viveklingaiah commented Dec 17, 2025

Copy link
Copy Markdown
Contributor

Problem(s):

  • lsof command hanging during execution of run-command-shim for 3P customers Chevron, CapGemini. This hang has caused timeouts with Run Command v1 Linux (even before executing the user's script)
    ICM incidents mentioned here Task 34627514: Investigate RCv2 Linux NOT using lsof command OR using alternatives (lsof command hanging)
  • "lsof command not found" error due to its package not installed

Background:

  • lsof command is being used currently in shim file of extensions (Run command v1, v2, custom script) to prevent issues caused by race conditions
  • lsof command in shim is used to "list open files by the run command binary run-command-handler". If there are any open files already, we do not run the script, we wait and retry today.
  • lsof is performance-intensive (lists processes, examines all file descriptors and symbolic links. It can get stuck if there are issues examining file descriptors on an unavailable/unresponsive mount, Network file system , DNS timeouts on name resolution of IPs). This can cause hang issues on high load servers of our customers

Solution

  • Use flock command to execute the script (do not use lsof)
  • flock command provides a way to acquire a lock on file first before executing run command script, thus preventing issues due to race conditions (note we double fork command with enable, so flock command is not blocking, it returns as soon as the background process is started to execute the script)
  • Unlike lsof, which is often a standalone package, flock is part of util-linux is a core collection of essential system utilities which is almost always pre-installed on standard Linux systems. This also prevents issues such as "lsof command not found"

Sample execution from log:
2025-12-17T21:16:40.356264Z INFO ExtHandler [Microsoft.Azure.Extensions.Edp.RunCommandHandlerLinuxTest.RC1218_7-1.14.0] Command: bin/run-command-shim enable
[stdout]
flock -x -n ./run-command-handler.lock -c 'nohup /var/lib/waagent/Microsoft.Azure.Extensions.Edp.RunCommandHandlerLinuxTest-1.14.0/bin/run-command-handler enable &'
'+ flock_status=0
'+ '[' 0 -eq 1 ']'
'+ '[' 0 -eq 0 ']'
'+ LOCK_ACQUIRED=1
'+ echo 'Lock acquired on file ./run-command-handler.lock and executed command nohup /var/lib/waagent/Microsoft.Azure.Extensions.Edp.RunCommandHandlerLinuxTest-1.14.0/bin/run-command-handler enable & successfully. Exiting.'
Lock acquired on file ./run-command-handler.lock and executed command nohup /var/lib/waagent/Microsoft.Azure.Extensions.Edp.RunCommandHandlerLinuxTest-1.14.0/bin/run-command-handler enable & successfully. Exiting.
'+ break
'+ set +x

@viveklingaiah viveklingaiah changed the title Use flock instead of lsof command in run-command-shim to prevent run commands hanging due lsof command hanging Use flock instead of lsof command in run-command-shim to prevent run commands hanging due to lsof command hanging Dec 17, 2025
Comment thread misc/run-command-shim
set -x
while (( retry_attempts < 10 )); do
# Acquire the exclusive (-x) and non-blocking (-n) lock on the lock file and execute $commandToExecute(side note: flock is part of util-linux package and is available by default on most Linux distros)
flock -x -n "$LOCKFILE" -c "$commandToExecute"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a condition to check if flock is available on the host OS. If its not available, I think we should either fallback to lsof, or forego checking for file lock altogether and let the error bubble up in the status file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants